INTERSPEECH.2014 - Speech Processing | Cool Papers

#1 A unified approach for underdetermined blind signal separation and source activity detection by multichannel factorial hidden Markov models [PDF] [Copy] [Kimi¹]

Authors: Takuya Higuchi ; Hirofumi Takeda ; Tomohiko Nakamura ; Hirokazu Kameoka

This paper proposes to introduce a new model called “the multi-channel factorial hidden Markov Model (MFHMM)” for under-determined blind signal separation (BSS). For monaural source separation, one successful approach involves applying non-negative matrix factorization (NMF) to the magnitude spectrogram of a mixture signal, interpreted as a non-negative matrix. Up to now, multichannel extensions of NMF, which allow for the use of spatial information as an additional clue for source separation, have been proposed by several authors and proven to be an effective approach for underdetermined BSS. This approach is based on the assumption that an observed signal is a mixture of a limited number of source signals each of which has a static power spectral density scaled by a time-varying amplitude. However, many source signals in real world are non-stationary in nature and the variations of the spectral densities are much richer in time. Moreover, many sources including speech tend to stay inactive for some while until they switch to an active mode, implying that the total power of a source may depend on its underlying state. To reasonably characterize such a non-stationary nature of source signals, this paper proposes to extend the multichannel NMF model by modeling the transition of the set consisting of the spectral densities and the total power of each source using a hidden Markov model (HMM). By letting each HMM contain states corresponding to active and inactive modes, we will show that voice activity detection and source separation can be solved simultaneously through parameter inference of the present model. The experiment showed that the proposed algorithm provided a 7.65 dB improvement compared with the conventional multichannel NMF in terms of the signal-to-distortion ratio.

#2 Enhancing audio source separability using spectro-temporal regularization with NMF [PDF] [Copy] [Kimi¹]

Authors: Colin Vaz ; Dimitrios Dimitriadis ; Shrikanth S. Narayanan

We propose a spectro-temporal regularization approach for NMF that accounts for a source's spectral variability over time. The regularization terms allow NMF to adapt the spectral basis matrices optimally to reduce mismatch between the spectral characteristics of sources observed during training and encountered during separation. We first tested our algorithm on a simulated source separation task. Preliminary results show significant improvement of SAR, SDR, and SIR values over some current NMF methods. We also tested our algorithm on a speech enhancement task and were able to show a modest improvement of the PESQ scores of the recovered speech.

#3 Blind speech source localization, counting and separation for 2-channel convolutive mixtures in a reverberant environment [PDF] [Copy] [Kimi³]

Authors: Sayeh Mirzaei ; Hugo Van hamme ; Yaser Norouzi

In this paper, the tasks of speech source localization, source counting and source separation are addressed for an unknown number of sources in a stereo recording scenario. In the first stage, the angles of arrival of individual source signals are estimated through a peak finding scheme applied to the angular spectrum which has been derived using non-linear GCC-PHAT. Then, based on the known channel mixture coefficients, we propose an approach for separating the sources based on Maximum Likelihood (ML) estimation. The predominant source in each time-frequency bin is identified through ML assuming a diffuse noise model. The separation performance is improved over a binary time-frequency masking method. The performance is measured by obtaining the existing metrics for blind source separation evaluation. The experiments are performed on synthetic speech mixtures in both anechoic and reverberant environments.

#4 Discriminative NMF and its application to single-channel source separation [PDF] [Copy] [Kimi¹]

Authors: Felix Weninger ; Jonathan Le Roux ; John R. Hershey ; Shinji Watanabe

The objective of single-channel source separation is to accurately recover source signals from mixtures. Non-negative matrix factorization (NMF) is a popular approach for this task, yet previous NMF approaches have not optimized directly this objective, despite some efforts in this direction. Our paper introduces discriminative training of the NMF basis functions such that, given the coefficients obtained on a mixture, a desired source is optimally recovered. We approach this optimization by generalizing the model to have separate analysis and reconstruction basis functions. This generalization frees us to optimize reconstruction objectives that incorporate the filtering step and SNR performance criteria. A novel multiplicative update algorithm is presented for the optimization of the reconstruction basis functions according to the proposed discriminative objective functions. Results on the 2nd CHiME Speech Separation and Recognition Challenge task indicate significant gains in source-to-distortion ratio with respect to sparse NMF, exemplar-based NMF, as well as a previously proposed discriminative NMF criterion.

#5 Vocal tract length estimation based on vowels using a database consisting of 385 speakers and a database with MRI-based vocal tract shape information [PDF] [Copy] [Kimi¹]

Authors: Hideki Kawahara ; Tatsuya Kitamura ; Hironori Takemoto ; Ryuichi Nisimura ; Toshio Irino

A highly-reproducible estimation method of vocal tract length (VTL) and text independent VTL estimation method are proposed based on a Japanese vowel database spoken by 385 male and female speakers ranging from age 6 to 56 and other vowel database with MRI-based vocal tract shape information. Proposed methods are based on interference-free power spectral representation and systematic suppression of biasing factors. MRI data is used to calibrate VTL estimation result to be represented in terms of physically meaningful unit. These databases are normalized based on the estimated VTL information to provide a reference template, which is used to implement a text independent VTL estimation method. A prototype system for text independent estimation of VTL is implemented using Matlab and runs faster than realtime on a PC.

#6 A graph-based Gaussian component clustering approach to unsupervised acoustic modeling [PDF] [Copy] [Kimi¹]

Authors: Haipeng Wang ; Tan Lee ; Cheung-Chi Leung ; Bin Ma ; Haizhou Li

This paper describes a new approach to unsupervised acoustic modeling, that is to build acoustic models for phoneme-like sub-word units from untranscribed speech data. The proposed approach is based on Gaussian component clustering. Initially a large set of Gaussian components are estimated from the untranscribed data. Then clustering is performed to group these Gaussian components into different clusters. Each cluster of Gaussian components forms an acoustic model for an induced sub-word unit. We have defined several similarity measures among the Gaussian components, and investigated several different graph-based clustering algorithms. Experiments on the TIMIT corpus demonstrate the effectiveness of our approach.

#7 A speech system for estimating daily word counts [PDF] [Copy] [Kimi¹]

Authors: Ali Ziaei ; Abhijeet Sangwan ; John H. L. Hansen

The ability to count the number of words spoken by an individual over long durations is important to researchers investigating language development, healthcare, education, etc. In this study, we attempt to build a speech system that can compute daily word counts using data from the Prof-Life-Log corpus. The task is challenging as typical audio files from Prof-Life-Log tend to be 8-to-16 hours long, where audio is collected continuously using the LENA device. This device is worn by the primary speaker and all his daily interactions are collected in fine detail. The recordings contain a wide variety of noise types with varying SNR (signal-to-noise ratio) including large crowd, babble, and competing secondary speakers. In this study, we develop a word-count estimation (WCE) system based on syllable detection and we use the method proposed by Wang and Narayanan as the baseline system [1]. We propose many modifications to the original algorithm to improve its effectiveness in noise. Particularly, we incorporate speech activity detection and enhancement techniques to remove non-speech from analysis and improve signal quality for superior syllable detection, respectively. We also investigate features derived from syllable detection for better word count estimation. The proposed method show significant improvement over the baseline.

#8 Ensemble modeling of denoising autoencoder for speech spectrum restoration [PDF] [Copy] [Kimi¹]

Authors: Xugang Lu ; Yu Tsao ; Shigeki Matsuda ; Chiori Hori

Denoising autoencoder (DAE) is effective in restoring clean speech from noisy observations. In addition, it is easy to be stacked to a deep denoising autoencoder (DDAE) architecture to further improve the performance. In most studies, it is supposed that the DAE or DDAE can learn any complex transform functions to approximate the transform relation between noisy and clean speech. However, for large variations of speech patterns and noisy environments, the learned model is lack of focus on local transformations. In this study, we propose an ensemble modeling of DAE to learn both the global and local transform functions. In the ensemble modeling, local transform functions are learned by several DAEs using data sets obtained from unsupervised data clustering and partition. The final transform function used for speech restoration is a combination of all the learned local transform functions. Speech denoising experiments were carried out to examine the performance of the proposed method. Experimental results showed that the proposed ensemble DAE model provided superior restoration accuracy than traditional DAE models.

#9 Single channel source separation with general stochastic networks [PDF] [Copy] [Kimi¹]

Authors: Matthias Zöhrer ; Franz Pernkopf

Single channel source separation (SCSS) is ill-posed and thus challenging. In this paper, we apply general stochastic networks (GSNs) — a deep neural network architecture — to SCSS. We extend GSNs to be capable of predicting a time-frequency representation, i.e. softmask by introducing a hybrid generative-discriminative training objective to the network. We evaluate GSNs on data of the 2nd CHiME speech separation challenge. In particular, we provide results for a speaker dependent, a speaker independent, a matched noise condition and an unmatched noise condition task. Empirically, we compare to other deep architectures, namely a deep belief network (DBN) and a multi-layer perceptron (MLP). In general, deep architectures perform well on SCSS tasks.

#10 Large-margin conditional random fields for single-microphone speech separation [PDF] [Copy] [Kimi¹]

Authors: Yu Ting Yeung ; Tan Lee ; Cheung-Chi Leung

Conditional random field (CRF) formulations for single-microphone speech separation are improved by large-margin parameter estimation. Speech sources are represented by acoustic state sequences from speaker-dependent acoustic models. The large-margin technique improves the classification accuracy of acoustic states by reducing generalization error in the training phase. Non-linear mappings inspired from the mixture-maximization (MIXMAX) model are applied to speech mixture observations. Compared with a factorial hidden Markov model baseline, the improved CRF formulations achieve better separation performance with significantly fewer training data. The separation performance is evaluated in terms of objective speech quality measures and speech recognition accuracy on the reconstructed sources. Compared with the CRF formulations without large-margin parameter estimation, the improved formulations achieve better performance without modifying the statistical inference procedures, especially when the sources are modeled with increased number of acoustic states.

#11 On the use of the Watson mixture model for clustering-based under-determined blind source separation [PDF] [Copy] [Kimi¹]

Authors: Ingrid Jafari ; Roberto Togneri ; Sven Nordholm

In this paper, we investigate the application of a generative clustering technique for the estimation of time-frequency source separation masks. Recent advances in time-frequency clustering-based approaches to blind source separation have touched upon the Watson mixture model (WMM) as a tool for source separation. However, most methods have been frequency bin-wise and have thus required the additional permutation alignment stage, and previous full-band methods which employ the WMM have yet to be applied to the under-determined setting. We propose to evaluate the clustering ability of the WMM within the clustering-based source separation framework. Evaluations confirm the superiority of the WMM against other previously used clustering techniques such as the fuzzy c-means.

#12 Binary mask estimation based on frequency modulations [PDF] [Copy] [Kimi¹]

Authors: Chung-Chien Hsu ; Jen-Tzung Chien ; Tai-Shih Chi

In this paper, a binary mask estimation algorithm is proposed based on modulations of speech. A multi-resolution spectro-temporal analytical auditory model is utilized to extract modulation features to estimate the binary mask, which is often used in speech segregation applications. The proposed method estimates noise from the beginning of each test sentence, a common approach seen in many conventional speech enhancement algorithms, to further enhance the modulation features. Experimental results demonstrate that the proposed method outperforms the AMS-GMM system in terms of the HIT-FA rate when estimating the binary mask.

#13 Bayesian factorization and selection for speech and music separation [PDF] [Copy] [Kimi¹]

Authors: Po-Kai Yang ; Chung-Chien Hsu ; Jen-Tzung Chien

This paper proposes a new Bayesian nonnegative matrix factorization (NMF) for speech and music separation. We introduce the Poisson likelihood for NMF approximation and the exponential prior distributions for the factorized basis matrix and weight matrix. A variational Bayesian (VB) EM algorithm is developed to implement an efficient solution to variational parameters and model parameters for Bayesian NMF. Importantly, the exponential prior parameter is used to control the sparseness in basis representation. The variational lower bound in VB-EM procedure is derived as an objective to conduct adaptive basis selection for different mixed signals. The experiments on single-channel speech/music separation show that the adaptive basis representation in Bayesian NMF via model selection performs better than the NMF with the fixed number of bases in terms of signal-to-distortion ratio.

#14 Self-adaption in single-channel source separation [PDF] [Copy] [Kimi¹]

Authors: Michael Wohlmayr ; Ludwig Mohr ; Franz Pernkopf

Single-channel source separation (SCSS) usually uses pre-trained source-specific models to separate the sources. These models capture the characteristics of each source and they perform well when matching the test conditions. In this paper, we extend the applicability of SCSS. We develop an EM-like iterative adaption algorithm which is capable to adapt the pre-trained models to the changed characteristics of the specific situation, such as a different acoustic channel introduced by variation in the room acoustics or changed speaker position. The adaption framework requires signal mixtures only, i.e. specific single source signals are not necessary. We consider speech/noise mixtures and we restrict the adaption to the speech model only. Model adaption is empirically evaluated using mixture utterances from the CHiME 2 challenge. We perform experiments using speaker dependent (SD) and speaker independent (SI) models trained on clean or reverberated single speaker utterances. We successfully adapt SI source models trained on clean utterances and achieve almost the same performance level as SD models trained on reverberated utterances.

#15 Dynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons [PDF] [Copy] [Kimi¹]

Authors: Ahmed Hussen Abdelaziz ; Dorothea Kolossa

Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled-HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a 23.72% average relative error rate reduction.

#16 Lipreading using convolutional neural network [PDF] [Copy] [Kimi¹]

Authors: Kuniaki Noda ; Yuki Yamaguchi ; Kazuhiro Nakadai ; Hiroshi G. Okuno ; Tetsuya Ogata

In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral coefficients. However, for visual speech recognition (VSR) studies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convolutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract visual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed system recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation results of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.

#17 Lipreading approach for isolated digits recognition under whisper and neutral speech [PDF] [Copy] [Kimi¹]

Authors: Fei Tao ; Carlos Busso

Whisper is a speech production mode normally used to protect confidential information. Given the differences in the acoustic domain, the performance of automatic speech recognition (ASR) systems decreases with whisper speech. An appealing approach to improve the performance is the use of lipreading. This study explores the use of visual features characterizing the lips' geometry and appearance to recognize digits under normal and whisper speech conditions using hidden Markov models (HMMs). We evaluate the proposed features on the digit part of the audiovisual whisper (AVW) corpus. While the proposed system achieves high accuracy in speaker dependent conditions (80.8%), the performance decreases when we evaluate speaker independent models (52.9%). We propose supervised adaptation schemes to reduce the mismatch between speakers. Across all conditions, the performance of the classifiers remain competitive even in the presence of whisper speech, highlighting the benefits of using visual features.

#18 Multimodal exemplar-based voice conversion using lip features in noisy environments [PDF] [Copy] [Kimi¹]

Authors: Kenta Masaka ; Ryo Aihara ; Tetsuya Takiguchi ; Yasuo Ariki

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audio-visual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

#19 Towards a practical silent speech recognition system [PDF] [Copy] [Kimi¹]

Authors: Yunbin Deng ; James T. Heaton ; Geoffrey S. Meltzner

Our recent efforts towards developing a practical surface electromyography (sEMG) based silent speech recognition interface have resulted in significant advances in the hardware, software and algorithmic components of the system. In this paper, we report our algorithmic progress, specifically: sEMG feature extraction parameter optimization, advances in sEMG acoustic modeling, and sEMG sensor set reduction. The key findings are: 1) the gold-standard parameters for acoustic speech feature extraction are far from optimum for sEMG parameterization, 2) advances in state-of-the-art speech modelling can be leveraged to significantly enhance the continuous sEMG silent speech recognition accuracy, and 3) the number of sEMG sensors can be reduced by half with little impact on the final recognition accuracy, and the optimum sensor subset can be selected efficiently based on basic mono-phone HMM modeling.

#20 Enhancing multimodal silent speech interfaces with feature selection [PDF] [Copy] [Kimi¹]

Authors: João Freitas ; Artur Ferreira ; Mário Figueiredo ; António Teixeira ; Miguel Sales Dias

In research on Silent Speech Interfaces (SSI), different sources of information (modalities) have been combined, aiming at obtaining better performance than the individual modalities. However, when combining these modalities, the dimensionality of the feature space rapidly increases, yielding the well-known “curse of dimensionality”. As a consequence, in order to extract useful information from this data, one has to resort to feature selection (FS) techniques to lower the dimensionality of the learning space. In this paper, we assess the impact of FS techniques for silent speech data, in a dataset with 4 non-invasive and promising modalities, namely: video, depth, ultrasonic Doppler sensing, and surface electromyography. We consider two supervised (mutual information and Fisher's ratio) and two unsupervised (mean-median and arithmetic mean geometric mean) FS filters. The evaluation was made by assessing the classification accuracy (word recognition error) of three well-known classifiers (k-nearest neighbors, support vector machines, and dynamic time warping). The key results of this study show that both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities. For instance, on the video component, we attain relative performance gains of 36.2% in error rates. FS is also useful as pre-processing for feature fusion.

#21 Opti-speech: a real-time, 3d visual feedback system for speech training [PDF] [Copy] [Kimi¹]

Authors: William Katz ; Thomas F. Campbell ; Jun Wang ; Eric Farrar ; J. Coleman Eubanks ; Arvind Balasubramanian ; Balakrishnan Prabhakaran ; Rob Rennaker

We describe an interactive 3D system to provide talkers with real-time information concerning their tongue and jaw movements during speech. Speech movement is tracked by a magnetometer system (Wave; NDI, Waterloo, Ontario, Canada). A customized interface allows users to view their current tongue position (represented as an avatar consisting of flesh-point markers and a modeled surface) placed in a synchronously moving, transparent head. Subjects receive augmented visual feedback when tongue sensors achieve the correct place of articulation. Preliminary data obtained for a group of adult talkers suggest this system can be used to reliably provide real-time feedback for American English consonant place of articulation targets. Future studies, including tests with communication disordered subjects, are described.

#22 Across-speaker articulatory normalization for speaker-independent silent speech recognition [PDF] [Copy] [Kimi¹]

Authors: Jun Wang ; Ashok Samal ; Jordan R. Green

Silent speech interfaces (SSIs), which recognize speech from articulatory information (i.e., without using audio information), have the potential to enable persons with laryngectomy or a neurological disease to produce synthesized speech with a natural sounding voice using their tongue and lips. Current approaches to SSIs have largely relied on speaker-dependent recognition models to minimize the negative effects of talker variation on recognition accuracy. Speaker-independent approaches are needed to reduce the large amount of training data required from each user; only limited articulatory samples are often available for persons with moderate to severe speech impairments, due to the logistic difficulty of data collection. This paper reported an across-speaker articulatory normalization approach based on Procrustes matching, a bidimensional regression technique for removing translational, scaling, and rotational effects of spatial data. A dataset of short functional sentences was collected from seven English talkers. A support vector machine was then trained to classify sentences based on normalized tongue and lip movements. Speaker-independent classification accuracy (tested using leave-one-subject-out cross validation) improved significantly, from 68.63% to 95.90%, following normalization. These results support the feasibility of a speaker-independent SSI using Procrustes matching as the basis for articulatory normalization across speakers.

#23 Conversion from facial myoelectric signals to speech: a unit selection approach [PDF] [Copy] [Kimi¹]

Authors: Marlene Zahner ; Matthias Janke ; Michael Wand ; Tanja Schultz

This paper reports on our recent research on surface electromyographic (EMG) speech synthesis: a direct conversion of the EMG signals of the articulatory muscle movements to the acoustic speech signal. In this work we introduce a unit selection approach which compares segments of the input EMG signal to a database of simultaneously recorded EMG/audio unit pairs and selects the best matching audio unit based on target and concatenation cost, which will be concatenated to synthesize an acoustic speech output. We show that this approach is feasible to generate a proper speech output from the input EMG signal. We evaluate different properties of the units and investigate what amount of data is necessary for an initial transformation. Prior work on EMG-to-speech conversion used a frame-based approach from the voice conversion domain, which struggles with the generation of a natural F0 contour. This problem may also be tackled by our unit selection approach.

#24 Towards real-life application of EMG-based speech recognition by using unsupervised adaptation [PDF] [Copy] [Kimi¹]

Authors: Michael Wand ; Tanja Schultz

This paper deals with a Silent Speech Interface based on Surface Electromyography (EMG), where electrodes capture the electric activity generated by the articulatory muscles from a user's face in order to decode the underlying speech, allowing speech to be recognized even when no sound is heard or created. So far, most EMG-based speech recognizers described in literature do not allow electrode reattachment between system training and usage, which we consider unsuitable for practical applications. In this study we report on our research on unsupervised session adaptation: A system is pre-trained with data from multiple recording sessions and then adapted towards the current recording session using data accruable during normal use, without requiring a time-consuming specific enrollment phase. We show that considerable accuracy improvements can be achieved with this method, paving the way towards real-life applications of the technology.

#25 Simple gesture-based error correction interface for smartphone speech recognition [PDF] [Copy] [Kimi¹]

Authors: Yuan Liang ; Koji Iwano ; Koichi Shinoda

Conventional error correction interfaces for speech recognition require a user to first mark an error region and choose the correct word from a candidate list. Taking the user's effort and the limited user interface available in a smartphone use into account, this operation should be simpler. In this paper, we propose an interface where users mark the error region once, and then the word will be replaced by another candidate. Assuming that the words preceding/succeeding the error region are validated by the user, we search the Web n-grams for long word sequences matched to such a context. The acoustic features of the error region are also utilized to rerank the candidate words. The experimental result proved the effectiveness of our method. 30.2% of the error words were corrected by a single operation.